Biostat 200B Homework 4
Due Feb 9 @ 11:59PM
Question A.1
How do we interpret the coefs of the interaction terms? Compare these parameter estimates to those from the separate models.
Answer:
The SAS codes are as follows:
The fitted model is \[\begin{align*} \hat{risk}&=\hat{\beta_0}+\hat{\beta}_{\text{regnc}}regnc+\hat{\beta}_{\text{regs}}regs+\hat{\beta}_{\text{regw}}regw+\hat{\beta}_{\text{length}}length+\\ &\quad \hat{\beta}_{\text{nclength}}nclength+\hat{\beta}_{\text{slength}}slength+\hat{\beta}_{\text{wlength}}wlength\\ &=\hat{\beta_0}+\hat{\beta}_{\text{regnc}}regnc+\hat{\beta}_{\text{regs}}regs+\hat{\beta}_{\text{regw}}regw+\\ &\quad (\hat{\beta}_{\text{length}}+\hat{\beta}_{\text{nclength}}regnc+\hat{\beta}_{\text{slength}}regs+\hat{\beta}_{\text{wlength}}regw)length \end{align*}\]
Coefficient for nclength = 0.30337, which means that for every one increase in average length of stay of all patients in hospital (in days), the increase in estimated mean risk in the North Central region will be 0.30337 percents more than that out of North Central region.
Coefficient for slength = 0.43930, which means that for every one increase in average length of stay of all patients in hospital (in days), the increase in estimated mean risk in the South region will be 0.43930 percents more than that out of South region.
Coefficient for wlength = -0.29237, which means that for every one increase in average length of stay of all patients in hospital (in days), the increase in estimated mean risk in the South region will be 0.29237 percents less than that out of West region.
The coefficients in the separate model region 2 for length is 0.60893, which is equal to the coefficient for the interaction terms nclength in the original model, 0.30337, plus the coefficient for length in the original model, 0.30556.
\(\hat{\beta}_{\text{length}}' = \hat{\beta}_{\text{nclength}} + \hat{\beta}_{\text{length}} = 0.30337 + 0.30556 = 0.60893\)
The coefficients in the separate model region 3 for length is 0.74486, which is equal to the coefficient for the interaction terms slength in the original model, 0.43930, plus the coefficient for length in the original model, 0.30556.
\(\hat{\beta}_{\text{length}}' = \hat{\beta}_{\text{slength}} + \hat{\beta}_{\text{length}} = 0.43930 + 0.30556 = 0.74486\)
The coefficients in the separate model region 4 for length is 0.01319, which is equal to the coefficient for the interaction terms wlength in the original model, -0.29237, plus the coefficient for length in the original model, 0.30556.
\(\hat{\beta}_{\text{length}}' = \hat{\beta}_{\text{wlength}} + \hat{\beta}_{\text{length}} = -0.29237 + 0.30556 = 0.01319\)
Question A.2
How would we test whether the slope coef for hospitals in the North Central region is equal to the slope coef for hospitals in the South region? Run this test.
Answer:
The SAS codes and test resutls are as follows:
For the North Cetnral region, the slope coef for length is \(\hat{\beta}_{\text{length}}+\hat{\beta}_{\text{nclength}}\), and for the South region, the slope coef for length is \(\hat{\beta}_{\text{length}}+\hat{\beta}_{\text{slength}}\). Since we want to test whether the slope coef for hospitals in the North Central region is equal to the slope coef for hospitals in the South region, then we only need to test whether \(\hat{\beta}_{\text{nclength}} = \hat{\beta}_{\text{slength}}\). Therefore, the null hypothesis \(H_0: \beta_{\text{nclength}} = \beta_{\text{slength}}\) is tested against the alternative hypothesis \(H_1: \beta_{\text{nclength}} \neq \beta_{\text{slength}}\).
From the F-test, we can see that the p-value is \(0.5374>0.05\), so we do not reject the null hypothesis and conclude that there is no significant evidence that the slope coef for hospitals in the North Central region is not equal to the slope coef for hospitals in the South region.
Question A.3
How would we test whether the slope coef for hospitals in the West region is equal to the slope coef for hospitals in the North East region? Run this test
Answer:
The SAS codes and test resutls are as follows:
For the North East region, the slope coef for length is \(\hat{\beta}_{\text{length}}\), i.e. regnc=regs=regw=0, and for the West region, the slope coef for length is \(\hat{\beta}_{\text{length}}+\hat{\beta}_{\text{wlength}}\). Since we want to test whether the slope coef for hospitals in the North East region is equal to the slope coef for hospitals in the West region, then we only need to test whether \(\hat{\beta}_{\text{wlength}} = 0\). Therefore, the null hypothesis \(H_0: \beta_{\text{wlength}} = 0\) is tested against the alternative hypothesis \(H_1: \beta_{\text{wlength}} \neq 0\).
From the F-test, we can see that the p-value is \(0.3147>0.05\), so we do not reject the null hypothesis and conclude that there is no significant evidence that the slope coef for hospitals in the North East region is not equal to the slope coef for hospitals in the West region.
Question A.4
How do these regression coefficients compare to the previous ones, with length not centered? Interpret each regression coef, including the intercept.
Answer:
The SAS codes and parameter estimates for the centered model are as follows:
Coefficients for lengthc, nclengthc, slengthc, and wlengthc are the same as the previous ones. Intercept and coefficients for regnc, regs, and regw change. The fitted model for the centered model is: \[\begin{align*}
\hat{risk}&=\hat{\beta_0}'+\hat{\beta}_{\text{regnc}}'regnc+\hat{\beta}_{\text{regs}}'regs+\hat{\beta}_{\text{regw}}'regw+\\
&\quad (\hat{\beta}_{\text{length}}'+\hat{\beta}_{\text{nclength}}'regnc+\hat{\beta}_{\text{slength}}'regs+\hat{\beta}_{\text{wlength}}'regw)(length-\bar{length})\\
&=\hat{\beta_0}'+\hat{\beta}_{\text{regnc}}'regnc+\hat{\beta}_{\text{regs}}'regs+\hat{\beta}_{\text{regw}}'regw-\\
&\quad (\hat{\beta}_{\text{length}}'+\hat{\beta}_{\text{nclength}}'regnc+\hat{\beta}_{\text{slength}}'regs+\hat{\beta}_{\text{wlength}}'regw)\bar{length}+\\
&\quad (\hat{\beta}_{\text{length}}'+\hat{\beta}_{\text{nclength}}'regnc+\hat{\beta}_{\text{slength}}'regs+\hat{\beta}_{\text{wlength}}'regw)length\\
&=\hat{\beta_0}+\hat{\beta}_{\text{regnc}}regnc+\hat{\beta}_{\text{regs}}regs+\hat{\beta}_{\text{regw}}regw+\\
&\quad (\hat{\beta}_{\text{length}}+\hat{\beta}_{\text{nclength}}regnc+\hat{\beta}_{\text{slength}}regs+\hat{\beta}_{\text{wlength}}regw)length
\end{align*}\]
Therefore, \[\begin{align*} &\hat{\beta_0}'+\hat{\beta}_{\text{regnc}}'regnc+\hat{\beta}_{\text{regs}}'regs+\hat{\beta}_{\text{regw}}'regw-(\hat{\beta}_{\text{length}}'+\hat{\beta}_{\text{nclength}}'regnc+\\ &\hat{\beta}_{\text{slength}}'regs+\hat{\beta}_{\text{wlength}}'regw)\bar{length}=\hat{\beta_0}+\hat{\beta}_{\text{regnc}}regnc+\hat{\beta}_{\text{regs}}regs+\hat{\beta}_{\text{regw}}regw \end{align*}\]
Also, \[\begin{align*} &(\hat{\beta}_{\text{length}}'+\hat{\beta}_{\text{nclength}}'regnc+\hat{\beta}_{\text{slength}}'regs+\hat{\beta}_{\text{wlength}}'regw)length=\\ &(\hat{\beta}_{\text{length}}+\hat{\beta}_{\text{nclength}}regnc+\hat{\beta}_{\text{slength}}regs+\hat{\beta}_{\text{wlength}}regw)length \end{align*}\]
So we have: \[\begin{align*} &\hat{\beta}_{\text{length}}'=\hat{\beta}_{\text{length}}\\ &\hat{\beta}_{\text{nclength}}'=\hat{\beta}_{\text{nclength}}\\ &\hat{\beta}_{\text{slength}}'=\hat{\beta}_{\text{slength}}\\ &\hat{\beta}_{\text{wlength}}'=\hat{\beta}_{\text{wlength}}\\ &\hat{\beta_0}'=\hat{\beta_0}+\hat{\beta}_{\text{length}}\bar{length}\\ &\hat{\beta}_{\text{regnc}}'=\hat{\beta}_{\text{regnc}}+\hat{\beta}_{\text{nclength}}\bar{length}\\ &\hat{\beta}_{\text{regs}}'=\hat{\beta}_{\text{regs}}+\hat{\beta}_{\text{slength}}\bar{length}\\ &\hat{\beta}_{\text{regw}}'=\hat{\beta}_{\text{regw}}+\hat{\beta}_{\text{wlength}}\bar{length} \end{align*}\]
Coefficient for intercept = 4.42042 = 1.47235+0.30556 \(\times\) and the interpretation is: when the values of regnc, regs, regw, lengthc, nclengthc, slengthc, and wlengthc are all equal to zero, the estimated mean risk is 4.42042 percents. However, this is only a meaningful interpretation if x=0 is reasonable.
Coefficient for length = 0.30556 and the interpretation is: for every one increase in the length, the estimated mean risk in region 1 will increase by 0.30556 percents.
Coefficient for nclengthc = 0.30337 and the interpretation is: for every one increase in average length of stay of all patients in hospital (in days), the increase in estimated mean risk in the North Central region will be 0.30337 percents more than that out of North Central region.
Coefficient for slength = 0.43930 and the interpretation is: for every one increase in average length of stay of all patients in hospital (in days), the increase in estimated mean risk in the South region will be 0.43930 percents more than that out of South region.
Coefficient for wlength = -0.29237 and the interpretation is: for every one increase in average length of stay of all patients in hospital (in days), the increase in estimated mean risk in the South region will be 0.29237 percents less than that out of West region.
Coefficient for regnc = -0.04825 = -2.97515+ 0.30337 \(\times\) 9.648 and the interpretation is: when the length is equal to zero, the estimated mean risk in region 2 is 0.04825 lower than than that in region 1.
Coefficient for regs = -0.15325= -4.39164+ 0.43930 \(\times\) 9.648 and the interpretation is: when the length is equal to zero, the estimated mean risk in region 3 is 0.15325 lower than than that in region 1.
Coefficient for regw = -0.01893= 2.80186-0.29237 \(\times\) 9.648$ and the interpretation is: when the length is equal to zero, the estimated mean risk in region 4 is 0.01893 lower than than that in region 1.
Question B.1
Log transformations: Fit the following models and provide the parameter estimate output (coefs, SEs, p-values), and interpret the regression coefficient associated with the predictor variable. Also, comment briefly on the model fit (residual) diagnostics as to whether the model appears to be a good fit to the data and whether the assumptions of the model are met.
- Regress
log10 SpikeIgG(Y) ondays PSO(X).
- Regress
ln SpikeIgG(Y) ondays PSO(X).
- Regress
days PSOonln(age).
Answer:
- The fitted model is as follows:
Interpretation for coefficient of days PSO: The effect of a one-unit increase in days PSO would be to multiply the estimated mean Spike IgG by \(10^{-0.00192}=0.996\), i.e., a 0.4% decrease. The model fit appears to be a good fit to the data as the residuals are normally distributed and the assumptions of the model are met. From the plots above, it seems that \(E(\epsilon_{i}) = 0\) for all \(\textit{i}\). And the error variance are roughly constant across all observations. Also, from QQ plot, all points roughly follow a straight line. So the assumptions of the normal error regression model are approximately met and the model appears to be a good fit to the data.
- The fitted model is as follows:
Interpretation for coefficient of days PSO: The effect of a one-unit increase in days PSO would be to multiply the estimated mean Spike IgG by \(e^{-0.00442}=0.996\), i.e., a 0.4% decrease. The model fit appears to be a good fit to the data as the residuals are normally distributed and the assumptions of the model are met. From the plots above, it seems that \(E(\epsilon_{i}) = 0\) for all \(\textit{i}\). And the error variance are roughly constant across all observations. Also, from QQ plot, all points roughly follow a straight line. So the assumptions of the normal error regression model are approximately met and the model appears to be a good fit to the data.
- The fitted model is as follows:
Interpretation for coefficient of ln(age): For every 1% increase in age, the estimated mean days PSO will increase by \(0.01*16.13521=0.16\). The model fit appears to be a good fit to the data as the residuals are normally distributed and the assumptions of the model are met. From the plots above, it seems that \(E(\epsilon_{i}) = 0\) for all \(\textit{i}\). And the error variance are roughly constant across all observations. Also, from QQ plot, all points roughly follow a straight line. So the assumptions of the normal error regression model are approximately met and the model appears to be a good fit to the data.
Question B.2
Log-log model: Using the covid_immune data, fit and interpret a log(Y)-log(X) model for the relationship between SpikeIgG and SpikeIgA. This should include doing the following:
- Produce and examine the distributions of
SpikeIgGandSpikeIgA, on their original scale and after log transformation.
- Produce a scatterplot of
SpikeIgG(y) versusSpikeIgA(x) with a loess smooth and the linear model fit.
- Interpret the coefficient associated with
log(SpikeIgA)from the log-log model.
- Use residuals to check whether the log-log model is a reasonable fit to the data.
Answer:
The distribution ofSpikeIgGandSpikeIgAon their original scale are as follows:
The distribution ofSpikeIgGandSpikeIgAis right-skewed.
log(SpikeIgG) and log(SpikeIgA) are as follows:
The distribution of log(SpikeIgG) and log(SpikeIgA) is approximately normal.
The scatterplot of
SpikeIgG(y) versusSpikeIgA(x) with a loess smooth and the linear model fit is as follows:It is obvious that \(x\) and \(y\) values are concentrated in a small-value region. The linear model fit is not good.
The fitted model is as follows:
Interpretation for coefficient of
log(SpikeIgA): For every 1% increase inSpikeIgA, the estimated meanSpikeIgGwill increase by 0.47350%.The residual plots are as follows:
The model fit appears to be a good fit to the data as the residuals are normally distributed and the assumptions of the model are met. From the plots above, it seems that \(E(\epsilon_{i}) = 0\) for all \(\textit{i}\). And the error variance are roughly constant across all observations. Also, from QQ plot, all points roughly follow a straight line. So the assumptions of the normal error regression model are approximately met. So the log-log model is a reasonable fit to the data.
Question B.3
Interaction between a categorical and continuous variable. For this problem, we will use interactions to explore whether the rate of exponential decay of SpikeIgG depends on the individual’s peak disease severity.
- In the covid_immune dataset, I created a variable
peakDiseaseSeveritythat is coded as 1 = asymptomatic or mild, 2 = moderate, 3 = severe. Determine the sample sizes in each category that have data for bothdaysPSOandSpikeIgG.
- Fit a model with an interaction between
daysPSOand peak disease severity. Use asymptomatic/mild as the reference category. The outcome variable should be log-transformedSpikeIgG.
- Conduct a joint test of whether the two interaction terms are equal to zero (this will be an F test).
- Regardless of the conclusion of the test for interaction, obtain point estimates of the half-lives of
SpikeIgGfor each of the 3 disease severity groups.
Answer:
The sample sizes in each category that have data for both
daysPSOandSpikeIgGare as follows:For category 1 = asymptomatic or mild, the sample size is 206. For category 2 = moderate, the sample size is 9. For category 3 = severe, the sample size is 12.
The fitted model is as follows:
The joint test of whether the two interaction terms are equal to zero is as follows:
The p-value of the joint test is \(0.5511>0.05\). So we do not reject the null hypothesis and conclude that there is no significant evidence that the two interaction terms are not equal to zero
The fitted model is \[\begin{align*} log(\hat{SpikeIgG}) &= 6.78870 +1.53034 \times dismo +2.52143 \times disse - 0.00489 \times daysPSO + \\ &\quad 0.00096 \times dismo*daysPSO - 0.00734 \times disse*daysPSO \\ &= 6.78870 +1.53034 \times dismo +2.52143 \times disse +\\ &\quad(- 0.00489+ 0.00096 \times dismo- 0.00734 \times disse) * daysPSO \end{align*}\]
So when peakDiseaseSeverity is 1 = asymptomatic or mild, i.e. dismo=disse=0, the coefficient of daysPSO is -0.00489.
When peakDiseaseSeverity is 2 = moderate, i.e. dismo=1, disse=0, the coefficient of daysPSO is -0.00489+0.00096=-0.00393.
When peakDiseaseSeverity is 3 = severe, i.e. dismo=0, disse=1, the coefficient of daysPSO is -0.00489-0.00734=-0.01223. Since the point estimates of the half-lives of SpikeIgG is \(t_{1/2}=\frac{\mathrm{ln}2}{-\beta}\), where \(\beta\) is the coefficient of daysPSO, then the point estimates of the half-lives of SpikeIgG for each of the 3 disease severity groups are as follows:
For category 1 = asymptomatic or mild, the point estimate of the half-life is \(t_{1/2}=\frac{\mathrm{ln}2}{0.00489}=141.3\) days.
For category 2 = moderate, the point estimate of the half-life is \(t_{1/2}=\frac{\mathrm{ln}2}{0.00393}=176.3\) days.
For category 3 = severe, the point estimate of the half-life is \(t_{1/2}=\frac{\mathrm{ln}2}{0.01223}=56.7\) days.